Module 8 Lecture - Multiple Comparisons Under Factorial ANOVA

Analysis of Variance

Quinton Quagliano, M.S., C.S.P

Department of Educational Psychology

1 Overview and Introduction

Agenda

1 Overview and Introduction

2 A Single Population Mean Using the Normal Distribution

3 A Single Population Mean Using the Student t Distribution

4 A Population Proportion

5 Conclusion

1.1 Textbook Learning Objectives

  • Calculate and interpret confidence intervals for estimating a population mean and a population proportion.
  • Interpret the Student’s t probability distribution as the sample size changes.
  • Discriminate between problems applying the normal and the Student’s t distributions.
  • Calculate the sample size required to estimate a population mean and a population proportion given a desired confidence level and margin of error.

1.2 Instructor Learning Objectives

  • Understand the value of calculating a confidence interval in interpreting the “accuracy” of a certain statistic
  • Appreciate how the calculation and interpretation of a confidence interval builds upon our previous understanding of distributions and probability
  • Value how confidence intervals demonstrate the inherently probabilistic nature of quantitative analysis in research

1.3 Introduction

  • Important: Professional researcher often forget just how important it is to report confidence intervals in results - but they are actually very useful in interpretation!
  • Up until now, we have primarily been working with point estimate of sample statistics
    • E.g., a sample statistic of a mean (\(\bar{x}\)) is a point estimate of the population parameter mean (\(\mu\))
    • Same logic applied to standard deviation, variance, median, etc.
  • Inherently, our single point estimate of a statistic is going to be insufficient for telling us much about how close it really is to the population parameter
    • Thus, we need some other part to add to the point estimate to give us more information about the population
  • Important: Remember our goal with statistics is to tell us something about the population, not just the sample!
  • This is, in essence, getting us closer to inferential statistic, where we can actually infer meaning from the characteristics of our sample

  • The part that we need to add are called confidence interval

    • A confidence interval is effectively a range of values that we believe a population parameter falls in, with a certain amount of confidence
    • It is actually appropriate to refer to the confidence interval itself as a random continuous variable, as it is referring to a distribution of possible values that sample statistics can be drawn from
  • Discuss: As a review, try explaining, in your own words, what a 'random' variable is?
  • At first, we’ll talk about applications of confidence intervals to the mean, implying that we are working with continuous variables
    • We’ll touch on more categorical, discrete type stuff later

1.4 Nuances in the Confidence Intervals

  • Commonly, we use 95 percent as the level of confidence
    • However we can use something like 99 percent for a more conservative estimate
  • Discuss: If you have taken statistics before, what other value is commonly associated with 95\%
  • Also, it is important to treat the confidence interval as another type of estimate, which we usually call a interval estimate
    • Just like our point estimates, there is no guarantee this in infallible
    • In fact, the confidence interval is best understood as a description of the reliability of sample statistics when taken from the same population, but does not mean it is certain that the population parameter falls within the range
  • Discuss: For review, describe again why we can't take sample estimates as being representative of populations? hint: use the vocabulary sample v-----
  • Depending on the field and the outlet you may see confidence intervals described as something like margin of error

1.5 Basic Calculating Confidence Intervals

  • In order to calculate the confidence interval of a certain statistic, we need to know the standard error of the variable that we are interested in
    • Recall, this is the long-term standard deviation of the distribution that our statistic comes from
  • However, this is most easy to know when we can assume that our variable comes from a normal distribution
    • This comes back to being able to appeal to the empirical rule that was introduced as part of understanding normal distributions
    • Remember that the empirical rule is also sometimes called the 68-95-99.7 rule
  • Discuss: Based upon the description above, try re-describing what this rule says about normal distributed variables
  • Important: There are several slight variations on the empirical rule, but it is easiest to round things off to whole number standard deviations
  • In order to calculate the 95% confidence interval, we use the following formula:
    • \([PE - 2SE, PE + 2SE]\)
    • Where \(PE\): is our point estimate
    • \(SE\): standard error of the statistic
    • \(2SE\) is used in appeal to the empirical rule, because, theoretically, 95% of the values of the statistic fall within 2 standard deviations; we can use 1 or 3 to find the 68% or 99.7% confidence intervals, respectively
  • Important: Recall that the standard error decreases as sample size increases, which also means that a larger sample size results in a more narrow confidence interval
  • This becomes practically interpreted as 95% of the sample statistics taken from this population with the same sample size (\(n\)) would give us a statistic within this range 95% of the time

  • For example: I have a point estimate mean statistic (\(\bar{x}\)) of 100, a sample size (\(n\)) of 40, and a population parameter standard deviation (\(\sigma\)) of 10

    • Review: \(SE = \frac{\sigma}{\sqrt{n}}\)
    • Thus: \(SE = \frac{10}{\sqrt{40}} = 1.58\)
    • So 95% CIs are: \([100 - 3.16, 100 + 3.16] = [96.84, 103.16]\)
    • Practical interpretation: 95% of infinitely many sample statistics taken from this population will fall between 96.84 and 103.16
  • Discuss: Try following the same procedures for a point estimate mean of 50, a sample size of 12, and a population standard deviation of 20

2 A Single Population Mean Using the Normal Distribution

Agenda

1 Overview and Introduction

2 A Single Population Mean Using the Normal Distribution

3 A Single Population Mean Using the Student t Distribution

4 A Population Proportion

5 Conclusion

2.1 Introduction

  • In this section, we’ll say more to the vocabulary and specifics used in each part of the calculation process, when the population standard deviation is known
    • You may reject that we can often know the population SD, which is fair, we’ll get into that in a bit
    • For this scenario, we rely upon the central limit theorem
  • We arbitrarily chose our confidence level (CL), but is usually 90% or above; it is how certain we want to be that the population parameter falls within the confidence intervals
    • Based upon that CL, we can calculate the error bound for a population mean (EBM), which is the amount we deviate (i.e., add or subtract) from the point estimate mean
  • Another way of conceptualizing confidence level is through alpha
    • It is just the complement of CL: \(\alpha + CL = 1\)

  • Important: For my folks who haven't taken stats before, alpha will show up again when we talk about p-values!
  • Consider my prior example:
    • I have a point estimate mean statistic (\(\bar{x}\)) of 100, a sample size (\(n\)) of 40, and a population parameter standard deviation (\(\sigma\)) of 10
      • Review: \(SE = \frac{\sigma}{\sqrt{n}}\)
      • Thus: \(SE = \frac{10}{\sqrt{40}} = 1.58\)
      • So 95% CIs are: \([100 - 3.16, 100 + 3.16] = [96.84, 103.16]\)
  • Discuss: Identify what is the CL, alpha, and EBM in this worked problem
  • Discuss: Make sure you can properly describe the different between a standard deviation of the population and the standard error of the mean

2.2 Writing Z-score Notation for Confidence Intervals

  • Thinking back to the image above, we split \(\alpha\) between the two tails of the normal distribution
    • Thus, each tail contains \(\frac{\alpha}{2}\) area, and the z-score that denotes where the CI bound is is \(z_\frac{\alpha}{2}\)
    • Therefore the upper 95% CI bound z is stated as \(z_{\frac{0.05}{2}}\) or \(x_{0.025}\)
  • Discuss: Now do the same thing with a 92\% CI lower bound

3 A Single Population Mean Using the Student t Distribution

Agenda

1 Overview and Introduction

2 A Single Population Mean Using the Normal Distribution

3 A Single Population Mean Using the Student t Distribution

4 A Population Proportion

5 Conclusion

3.1 Introduction

  • As mentioned prior, we don’t often have what we can easily consider to be a reasonable population standard deviation
    • In fact, that may be one of the many things we are trying to estimate!
  • Discuss: Quickly address why you won't regularly have the population parameters and under what circumstances would you have them?
  • William Gosset was staff at the Guinness Brewery and, under the pen name of “Student” came up with the idea of Student's t-distribution to be used when the plain normal distribution didn’t work
    • Use of Student’s t-distribution is now the standard way to calculate confidence intervals for means

3.2 More on the t-distribution

  • Theoretical steps:
    • Draw a simple random sample of size \(n\)
    • Population of normal distribution with parameters mean \(\mu\) and standard deviation \(\sigma\)
    • Convert means of the distribution to t-scores with \(\frac{\bar{x} - \mu}{(\frac{s}{\sqrt{n}})}\)
  • Results in a Student’s t-distribution with \(n - 1\) degrees of freedom (df)
    • The individual \(t\) score of a mean is understood as the distance that \(\bar{x}\) is from \(\mu\), like a z-score
  • Degrees of freedom are a tricky concept, but come up often in most inferential statistical formulas
    • When calculating standard deviation for a sample, we calculate \(n\) number of deviations and because we know that the sum of all deviations is 0 (if we don’t square them, like in the variance formula), the last deviation cannot vary.
    • Thus, all deviations (\(n\)) can vary until this last one, leading to \(n - 1\) as the degrees of freedom in this case
  • Important: Perhaps the most defining feature of the t-distribution that separates it from the normal distribution is that it changes with the sample size
  • The notation for the t-distribution is given at \(T \sim t_df\)
    • Where \(df = n - 1\), as previously discussed

3.3 Calculating with t-distribution

  • In the case that we don't know the population standard deviation, we can use the t-distribution in the following way

  • \(EBM = (t_\frac{\alpha}{2})(\alpha{s}{\sqrt{n}})\)

    • \(t_\frac{\alpha}{2}\) is t-score with area to the riqht of \(\frac{\alpha}{2}\)
    • \(s\) is the sample standard deviation statistic
    • \(df\) is degrees of freedom of \(n-1\)
  • Important: Calculation with the t-distribution has to proceed with either a table or a calculator / computer - we'll cover what to do in SPSS

4 A Population Proportion

Agenda

1 Overview and Introduction

2 A Single Population Mean Using the Normal Distribution

3 A Single Population Mean Using the Student t Distribution

4 A Population Proportion

5 Conclusion

4.1 Introduction

  • Important: I'm going to abridge this part a bit only because it get's a bit too 'in the weeds' for what we need to know
  • Often, we are interested not just in the confidence interval of means, but also of percentages and proportions

  • First, we need to identify a situations that is more proportion-based than mean-based

    • In these scenarios, the underlying distribution is the binomial distribution
  • Discuss: For review: write the notation formula to describe a binomial distribution and describe the characteristics of that distribution

4.2 Calculation of a Proportion CI

  • Because of the underlying binomial, we can take a proportion as: \(P' = \frac{X}{N}\)

    • Where:
    • \(X\) is a random variable of number of successes
    • \(n\) is the number of trails/sample size
    • \(P'\) is also sometimes shown as \(\hat{P}\)
  • In the scenario that \(n\) is particularly large and \(p\) is not close to zero or one, we can actually treat this as normal like: \(X \sim N(np, \sqrt{npq})\)

  • This all eventually works down to \(\frac{X}{n} = P' \sim N(\frac{np}{n}, \frac{\sqrt{npq}}{n})\)

    • The second part after the comma is the treated as the “standard” error for proportions
  • One can calculate error bound for a proportion (EBP) to work towards a sort of confidence interval, but for proportions, just like we did for means

5 Conclusion

Agenda

1 Overview and Introduction

2 A Single Population Mean Using the Normal Distribution

3 A Single Population Mean Using the Student t Distribution

4 A Population Proportion

5 Conclusion

5.1 A Plea for Confidence Intervals

  • I am a big advocate of the value of confidence intervals in a lot of situations - I think they appropriately capture that these statistics really are just estimates/guesses!

  • In my opinion, it is pretty much always appropriate to report confidence intervals alongside the point estimates of just about anything you calculate - it increases information and transparency

5.2 Recap

  • Confidence intervals add valuable information about the relative spread and inaccuracy of sample statistics in means and in proportions

  • Confidence intervals pretty much always become more narrow with larger sample size, continuing the trends demonstrated with the central limit theorem and the law of large numbers

  • We introduced the t-distribution as a valuable way to calculate confidence intervals for means, even when lacking the population standard deviation

5.3 Lecture Check-in

  • Make sure to complete any lecture check-in tasks associated with this lecture!

Module 8 Lecture - Multiple Comparisons Under Factorial ANOVA || Analysis of Variance